Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add metric pgrst_jwt_cache_size in admin server #3802

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

taimoorzaeem
Copy link
Collaborator

@taimoorzaeem taimoorzaeem commented Nov 26, 2024

Add metric pgrst_jwt_cache_size in admin server which shows the cache size in bytes.

@steve-chavez
Copy link
Member

Don't forget the feedback about the actual size in bytes #3801 (comment)

Ultimately, I believe we're going to need an LRU cache for the JWT cache to be production ready. So having this metric in bytes will be useful now and later.

@taimoorzaeem
Copy link
Collaborator Author

Don't forget the feedback about the actual size in bytes #3801 (comment)

Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes.

It works as follows:

Data.Cache gives a function toList which returns that cache entries in a list of tuples ([(ByteString, AuthResult, Maybe TimeSpec)] in our case).

Now, we can use ghc-datasize library code to calculate the byte size of ByteString and AuthResult but not Maybe TimeSpec (because recursiveSizeNF only works on types that are an instance of NFData typeclass, hence I am calling it an "approximation").

@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch 2 times, most recently from 1254ed5 to 11727ab Compare December 17, 2024 06:50
@steve-chavez
Copy link
Member

This is pretty cool, so I do see the size starting at 0 then increasing as I do requests:

# $ nix-shell
# $ PGRST_ADMIN_SERVER_PORT=3001  PGRST_JWT_CACHE_MAX_LIFETIME=30000 postgrest-with-postgresql-16  -f test/spec/fixtures/load.sql postgrest-run

$ curl localhost:3001/metrics
# HELP pgrst_jwt_cache_size The number of cached JWTs
# TYPE pgrst_jwt_cache_size gauge
pgrst_jwt_cache_size 0.0

$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 72.0

$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 144.0

$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 216.0

Of course this doesn't drop down after a while because we need #3801 for that.

One issue that I've noticed is that we're printing empty log lines for each request:

[nix-shell:~/Projects/postgrest]$ PGRST_ADMIN_SERVER_PORT=3001  PGRST_JWT_CACHE_MAX_LIFETIME=30000 postgrest-with-postgresql-16  -f test/spec/fixtures/load.sql postgrest-run
...
18/Jan/2025:18:16:32 -0500: Schema cache loaded in 17.1 milliseconds
18/Jan/2025:18:16:34 -0500: 
18/Jan/2025:18:16:38 -0500: 
18/Jan/2025:18:16:42 -0500: 

This is due to the addition of the new Observation

And the following line here:

This is surprising behavior, I'll try to refactor this on a new PR.

For now, how about printing a message like:

JWTCache sz-> "The JWT Cache size increased to " <> sz <> "bytes"

This should only happen for a log-level greater than debug, check how this is done on:

observationLogger :: LoggerState -> LogLevel -> ObservationHandler
observationLogger loggerState logLevel obs = case obs of
o@(PoolAcqTimeoutObs _) -> do
when (logLevel >= LogError) $ do
logWithDebounce loggerState $
logWithZTime loggerState $ observationMessage o
o@(QueryErrorCodeHighObs _) -> do
when (logLevel >= LogError) $ do
logWithZTime loggerState $ observationMessage o
o@(HasqlPoolObs _) -> do
when (logLevel >= LogDebug) $ do
logWithZTime loggerState $ observationMessage o
PoolRequest ->
pure ()
PoolRequestFullfilled ->
pure ()
o ->
logWithZTime loggerState $ observationMessage o

@steve-chavez
Copy link
Member

Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes.

The above is understandable. What we really need is a good enough approximation so we have an order-of-magnitude understanding to see if the cache size is in KB, MB or GB. So far this seems enough. We should definitely document why we do an approximation though.

src/PostgREST/App.hs Outdated Show resolved Hide resolved
@steve-chavez
Copy link
Member

I've just noticed that perf badly drops down on this PR:

param v12.2.3 head main
throughput 448 121 399

https://github.com/PostgREST/postgrest/pull/3802/checks?check_run_id=34517395528

The recursiveSize function says:

This function works very quickly on small data structures, but can be slow on large and complex ones. If speed is an issue it's probably possible to get the exact size of a small portion of the data structure and then estimate the total size from that.

@taimoorzaeem What could we do to avoid this drop? Maybe calculate the cache size periodically on a background thread? Any thoughts?

src/PostgREST/Auth.hs Outdated Show resolved Hide resolved
src/PostgREST/Auth.hs Outdated Show resolved Hide resolved
@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch from 11727ab to 145c236 Compare January 19, 2025 14:41
@taimoorzaeem
Copy link
Collaborator Author

Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes.

The above is understandable. What we really need is a good enough approximation so we have an order-of-magnitude understanding to see if the cache size is in KB, MB or GB. So far this seems enough. We should definitely document why we do an approximation though.

Now that I have gotten better at haskell 🚀, I solved the issue and we don't need an "approximation". We CAN calculate full cache size in bytes.

@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch 3 times, most recently from 4b1afd1 to d9de43a Compare January 19, 2025 18:11
@taimoorzaeem
Copy link
Collaborator Author

taimoorzaeem commented Jan 19, 2025

Load test and memory test failing with:

src/PostgREST/Auth.hs:44:1: error:
    Could not find module ‘GHC.DataSize’
    Perhaps you haven't installed the profiling libraries for package ‘ghc-datasize-0.2.7’?
    Use -v (or `:set -v` in ghci) to see a list of the files searched for.
   |
44 | import GHC.DataSize            (recursiveSizeNF)

Is there any additional configuration that needs to be added for profiling?

Comment on lines 53 to 55
# nixpkgs have ghc-datasize-0.2.7 marked as broken
ghc-datasize = lib.markUnbroken prev.ghc-datasize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# nixpkgs have ghc-datasize-0.2.7 marked as broken
ghc-datasize = lib.markUnbroken prev.ghc-datasize;
# TODO: Remove this once https://github.com/NixOS/nixpkgs/pull/375121
# has made it to us.
ghc-datasize = lib.markUnbroken prev.ghc-datasize;

@wolfgangwalther
Copy link
Member

Load test and memory test failing with:

Does it happen locally, too?

I am confused, can't exactly spot what's wrong right now.

@taimoorzaeem
Copy link
Collaborator Author

Does it happen locally, too?

I am confused, can't exactly spot what's wrong right now.

Yes, it does happen locally.

[nix-shell]$ postgrest-loadtest
...
src/PostgREST/Auth.hs:44:1: error:
    Could not find module ‘GHC.DataSize’
    Perhaps you haven't installed the profiling libraries for package ‘ghc-datasize-0.2.7’?
    Use -v (or `:set -v` in ghci) to see a list of the files searched for.
   |
44 | import GHC.DataSize            (recursiveSizeNF)
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch 2 times, most recently from 7ff02db to 6eb7d3c Compare January 20, 2025 17:19
@taimoorzaeem
Copy link
Collaborator Author

Need some way to run postgrest-loadtest on the CI. Currently it is failing because of building with Nix. Running PGRST_BUILD_CABAL=1 posgrest-loadtest works locally but I am not sure how to set this up on CI temporarily to check loadtest results. Does running loadtest on CI equivalent to running it locally? Would I get same results?

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Jan 20, 2025

Yeah, running the loadtest in CI is not working when dependencies are changed, because it needs to run against the base branch, which doesn't have the dependencies, yet... and then cabal just breaks it somehow.

You can run something like this locally to get the same markdown output:

          postgrest-loadtest-against main v12.2.5
          postgrest-loadtest-report

(but it's likely that it fails the same way... :D)

@wolfgangwalther
Copy link
Member

Perhaps you haven't installed the profiling libraries for package ‘ghc-datasize-0.2.7’?

The thing I don't understand about this error message is, that it appears in contexts where we don't use profiling libs. We do for the memory test, so the error message makes "kind of sense" (I still don't understand why it happens, though). But for loadtest and the regular build on MacOS... those don't use profiling.

Hm... actually - we don't do a regular dynamically linked linux build via nix, I think. So in fact it fails for every nix build except the static build. Still don't know what's happening, though.

@steve-chavez
Copy link
Member

[nix-shell]$ postgrest-loadtest
...
src/PostgREST/Auth.hs:44:1: error:

I get a similar error when trying this locally too.

Yeah, running the loadtest in CI is not working when dependencies are changed, because it needs to run against the base branch, which doesn't have the dependencies, yet... and then cabal just breaks it somehow.

To unblock the PR, how about only adding the dependency in another PR and then merging it? Then I assume this PR would run the loadtest?

@wolfgangwalther
Copy link
Member

To unblock the PR, how about only adding the dependency in another PR and then merging it? Then I assume this PR would run the loadtest?

No I don't think so. It seems the dependency issue was fixed a while ago in 0c5d2e5. Also the fact that all nix builds fail, the memory test, on darwin, etc. - indicates that there is something else going on.

@taimoorzaeem taimoorzaeem marked this pull request as ready for review January 27, 2025 06:48
@taimoorzaeem taimoorzaeem marked this pull request as draft January 27, 2025 16:52
@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch 2 times, most recently from b5331af to de80a9e Compare January 27, 2025 17:32
@taimoorzaeem taimoorzaeem marked this pull request as ready for review January 27, 2025 17:34
@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch from de80a9e to 7034d38 Compare January 27, 2025 18:00
Copy link
Member

@steve-chavez steve-chavez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work 🚀

The codecov failures look related to Internal.hs, which should go once #3881 is solved.

src/PostgREST/Auth.hs Outdated Show resolved Hide resolved
Comment on lines 186 to 192
-- if token not found, add to cache and increment cache size metric
case (authResult,checkCache) of
(Right res, Nothing) -> C.insert' (getJwtCache appState) (getTimeSpec res maxLifetime utc) token res
(Right res, Nothing) -> do
let tSpec = getTimeSpec res maxLifetime utc
C.insert' jwtCache (Just tSpec) token res
entrySize <- calcCacheEntrySizeInBytes (token, res, tSpec)
observer $ JWTCache entrySize -- adds size to metric
Copy link
Member

@steve-chavez steve-chavez Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand how is the incremental calculation done. Where is the JWT cache size stored and then incremented? My thinking is that the cache size should be stored somewhere in AppState, but there's nothing added there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I think this PR isn't quite ready yet. I am now thinking of a better design. I'll take another look tomorrow.

@taimoorzaeem taimoorzaeem marked this pull request as draft January 27, 2025 21:53
@taimoorzaeem
Copy link
Collaborator Author

taimoorzaeem commented Jan 28, 2025

@steve-chavez I think we need a design decision here. We are currently allowing our in-memory JWT cache to grow infinitely larger with no upper bound. I think this is bad because it will cause memory exhaustion and impact performance. We also don't need an explicit expiration support that Data.Cache gives us because we already have exp claim for that.

Switch to LRU Cache

As you said earlier that we need an LRU cache for caching to be production ready, how about we switch our cache implementation to https://hackage.haskell.org/package/lrucache.This would also mean that we need a potentially configurable maximum cache size (add new config?). WDYT?

@steve-chavez
Copy link
Member

@taimoorzaeem Yes, that is the end goal. However wouldn't that imply a breaking change? It would require a new major version.

Since the current JWT cache is broken (#3788), we should release a new minor with the fix and then do the LRU cache for the new major.

@wolfgangwalther
Copy link
Member

we should release a new minor with the fix

We can't release a new minor version. We can only release a new patch version on the v12 branch or a new major on the main branch. We still have a few breaking changes that we've been carrying for quite a while on main (removal of PostgreSQL), and we have never made a release for v13.

@steve-chavez
Copy link
Member

How should we handle this in v12 then? Maybe put a "danger" note on https://postgrest.org/en/v12/references/auth.html#jwt-caching, telling that the feature is broken and should only be used on v13?

@wolfgangwalther
Copy link
Member

How should we handle this in v12 then? Maybe put a "danger" note on https://postgrest.org/en/v12/references/auth.html#jwt-caching, telling that the feature is broken and should only be used on v13?

Can we not run purgeExpired periodically? Why do we need this feature for it?

@steve-chavez
Copy link
Member

steve-chavez commented Jan 28, 2025

Can we not run purgeExpired periodically? Why do we need this feature for it?

There's no way to test that the fix is working correctly without this. Check https://github.com/PostgREST/postgrest/pull/3801/files#r1855025853. There the feature was added on the same PR as the fix.

We can't release a new minor version. We can only release a new patch version on the v12 branch or a new major on the main branch. We still have a few breaking changes that we've been carrying for quite a while on main (removal of PostgreSQL), and we have never made a release for v13.

Why is the above the case btw? We could not pick the breaking changes for the minor.

@wolfgangwalther
Copy link
Member

Can we not run purgeExpired periodically?

There's no way to test that the fix is working correctly without this.

If we can use this feature to test that a separate commit as a fix is working on the main branch, we might just backport the fix without the test.

Why is the above the case btw? We could not pick the breaking changes for the minor.

Partly because that's the workflow we agreed on. We are backporting fixes, not features. But there's reason behind it:

  • Once you start backporting features, you are going to backport a lot more code. Does this feature apply cleanly on v12? Do we need to backport other features first? Why do we release a minor with some features, but not with others? The complexity just increases incredibly.
  • The risk increases. Every new feature increases the risk to introduce new bugs. We agreed to backport fixes to enable us to do two things at the same time: Iterate quickly with new stuff on the main branch, but also stabilize the currently stable release. We were counter-acting a pattern we had back then: After a new minor/major, we always had to quickly ship some bugfixes - but if our main branch already advanced past the next patch release, we were not able to anymore.

I fully agree we need to fix the bug. But we should not introduce new risks by doing so.

@steve-chavez
Copy link
Member

Partly because that's the workflow we agreed on. We are backporting fixes, not features. But there's reason behind it:

Agree, thanks for the reminder.

If we can use this feature to test that a separate commit as a fix is working on the main branch, we might just backport the fix without the test.

Sounds good. I'll just cherry-pick this PR commit locally while reviewing #3801 then.

@taimoorzaeem I suggest keeping the LRU cache improvement after #3801 is merged. We should clear that first.

@taimoorzaeem
Copy link
Collaborator Author

@taimoorzaeem I suggest keeping the LRU cache improvement after #3801 is merged. We should clear that first.

Sure, sounds good! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants